From Surveys to Populations

GVPT399F: Power, Politics, and Data

Surveys

  • Populations are very difficult to collect data on

    • Even the census misses people!
  • Happily, we can use surveys of a sample of our population to learn things about our population

  • However, our ability to do this is conditional on how good our sample is

  • What do I mean by “good”?

The 2024 US Presidential Election

  • Elections are preceded by a flood of surveys

Surveys

  • Surveys are conducted on a subset (sample) of the population of interest

  • Our population of interest: individuals who voted in the 2024 US Presidential Election

A good sample

  • A good sample is a representative one

  • How closely does our sample reflect our population

Parallel worlds

  • Remember back to last session on experiments

  • In an ideal world, we would be able to create two parallel worlds (one with the treatment, one held as our control)

    • One version of the election booth run without monitors (the control)

    • One version with monitors (the treatment)

  • These worlds are perfectly identical to each other prior to treatment

  • We cannot do this :(

The next best thing

  • Our next best option is to create two groups that were as identical to one another as possible prior to treatment

  • If they are (almost) identical, differences between their group-wide outcomes can be attributed to the treatment

  • One good way of getting two (almost) identical groups is to assign individuals to those groups randomly

    • Think back to our 1,000 hypothetical people!

Randomization

  • Randomization continues to pop its chaotic head up

  • We can use it to create a sample that is (almost) identical to our population, on average

  • Drawing randomly from our population increases our chances of ending up with a sample that reflects that population

  • This would be referred to as a representative sample

Random sampling

  • All individuals in the population need to have an equal chance of being selected for the sample

    • If this holds, you have a pure random sample
  • This is really hard to do!

    • How likely were you to answer the pollster’s unknown number, calling you in the middle of the day?

    • Even if you did answer, how likely were you to answer all their questions?

To illustrate

Countries’ GDP in 2022:

Countries’ GDP

I want to estimate the average GDP across all countries in 2022.

  • I send out a survey to all countries’ Departments of Statistics and ask for their GDP figures for 2022.

  • I get 60 responses:

sample_df <- gdp_df |> 
  drop_na(sample_value) |> 
  sample_n(size = 60) |> 
  transmute(country, gdp = sample_value)

sample_df
# A tibble: 60 × 2
   country              gdp
   <chr>              <dbl>
 1 Finland          2.80e11
 2 Uganda           4.56e10
 3 Papua New Guinea 3.16e10
 4 Sweden           5.80e11
 5 Cambodia         4.00e10
 6 Germany          4.16e12
 7 Angola           1.04e11
 8 Singapore        4.98e11
 9 Indonesia        1.32e12
10 Cote d'Ivoire    7.02e10
# ℹ 50 more rows

Countries’ GDP

I now calculate the average of these responses, which I find to be:

sample_df |> 
  summarise(avg_gdp = scales::dollar(mean(gdp, na.rm = T)))
# A tibble: 1 × 1
  avg_gdp         
  <chr>           
1 $828,117,470,747

Now, imagine that we knew definitively that it was NA. Why such a large difference?

Non-response bias

Poorer countries are far less likely to be able or willing to provide these economic data to academics or international organizations.

  • They tend to be underrepresented in a lot of data

My sample was biased against poorer countries.

  • They were not equally likely to respond to my request for data as rich countries

Large numbers

  • Randomization isn’t enough: we also need to draw a sufficiently large sample from our population

    • One person pulled randomly from the class isn’t going to be very reflective of the class!